/ 



DOCUMENT RESUME 



ED 228 278 



TM 830 153 



AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 
REPORT NO 
PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



Burry, James 

An Introduction to Assessment and. Design in Bilingual 
Program Evaluation. 

California^ Univ. , Los Angeles. Center ^f-or^the Study 
of Evaluation / . 
National Inst, of Education (ED) , Washington , DC. 
CSE-RP-1 * . » 

82 

53p. ; Paper presented at the Title VII Bilingual 
Education Management Institute (Los Angeles, CA, 
March 30-April 1981). 
Guides - Non-Classroom $se (055) — 
Speeches/Conference Papers (150) , 

MF01/PC03 Pius Postage. * 

Achievement Gains; *Bilingual Education Programs; 
*Data Analysis; Decision Making; Educational 
Objectives; "Evaluation Methods; ^Evaluation Needs; 4 
Needs Assessment; Program Effectiveness; "Program 
Evaluation; "Research Design; Resfearph Methodology; 
Test Use 

"Elementary Secondary Education , Act Title II 



7 



The most difficult problems in bilingual education 
evaluation are disagreement over what evaluation is and how it isf 
done; the debate over what bilingual education is and how a program 
is planned and^ operated locally; and the" nature of bilingual 
education itself, which creates* problems in assessment and design 
methodology. General information is j^rovided on three basic 
considerations in bilingual program evaluation: assessment, 
evaluation design, and data analysis. Assessment is the full range of 
information that might be used to make decisions about a bilingual 
program, including what it accomplishes for its studeriTS*as well as 
the procedures to achieve these goals. Measures of student 
performance and lOO^rogram^rocesses , such as interviews and 
observations, are examined. A brief examination of major designs used 



on focuses on the designs that seem to 
n a bilingual program setting. The most 
nvestigating program outcomes are th^ , 
gns. The exposure-to-treatment design is 
applicable for formative evaluations. Accountability designs, 
information about student achievement of .local object ives or 



in bilingual program evaluat 
be most useful and feasible 
general-purpose designs for 
time-series/longitudinal des 
widely 
report 



national norms. Basic issues raised concern the questions an 
evaluation might try, to answer, collection of information, and the 
appropriate analytic techniques. (CM) 



******************************** ********,******************************* 
" Reproductions supplied by -EDRS are the best that can be made * 
* * from the original document, * 

********************************************** *************** ********** 



.er|c 



oo 
r\J 

CO 
UJ 



U.S. DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 

y CENTER (ERIO 
Ihis document has beon reproduced as 
rctoivod from tho person or organization 
o'nginating it w 

Minor changes have beon made to improve 
reproduction quality 

• Points of view or opinions stated in this docu 
mont do not necessarily represent official NI6 
position or policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC);" 



AN INTRODUCTION TO ASSESSMENT AND DESIGN IN 
BILINGUAL PROGRAM EVALUATION 



James Burry 



CSE Resource Paper No. 1 
1982 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California,, Los Angeles 



■ r> 



r 



\ 



The work reported herein was supported in part under a grant from 
the National Institute of Education. However, the opinions 
expressed do *iot necessarily reflect the position or policy 'Of the 
National Institute of Education and no. of f icial endorsement should 
be inferred. 



.; TABLE OP CONTENTS 

■ * • ■ * . > 

INTRODUCTION . - ' • •* , 1 

BACKGROUND CONCEF^ . . . .V. • • • • '2 

General Problems in the Evaluation of feilingual Progi^ms . . 2 

An Approach to Evaluation — % 4 

The Nature^of Bilingual Education . — • 9 

Methodology Problems iA Design, and Testing '. 10 

IMPLICATIONS AND SUGGESTIONS FOR PRACTICE . Nr- 16 

x ^ Assessing Title VII Progifem Gains ....... ........... . . *16 

Title VII Program Components and Pupil Performance 20 

Describing the Program that L£d to the Gains ....... / .25 

. / • Selecting an Appropriate Evaluation Design and Analysis 

Technique • • • • ■ • • • ^6 

The Range of Designs 39 

Designs Recommended for Bilingual • Programs, * 39 

^ ' 42 
CONCLUSION * • • • 

' REFERENCES ... M 47 



) 



•erJc 



7 - 



AN INTRODUCTION TO ASSESSMENT AND DESIGN IN 
BILINGUAL, PROGRAM EVALUATION*, 



INTRODUCTION 



This paper offers general information on three «, basic consid- 
erations in bilingual program evaluation: .assessment^, evaluation 
design, and data analysis. By 'assessment is me'ant the full range\ 
of information that might be used to make decisions about a bilin- 
gual program, includin^what it accomplishes for its students as 
well as the procedures it follows to achieve these goals. Covered 
are measures of student performance and measures of program proc- 
esses, suet* as interviews and. observations. A t(rief examination 
of .major designs used in bilingual program evaluation will be 
made, focusing on the three or four designs that seem to be most 
useful and feasible in a bilingual program setting. Basic Issues 
will be raised concerning questions an evaluation, might try to an- 
swer, information that might be collected to answer them, and ana- 
lytic techniques appropriate to the particular questions asked and 
data collected,. 

These remarks- have two purposes: (1) for the internal evalt^- 
-ator of a bilingual program, especially for monitoring and improve 
ing the program while in operation? and (2) for the staff of a 

*This paper was presented at the Title VII Bilingual Education 
Management, Institute, Los Angeles, California, March ^O-April 1, 
1981. t 



bilingual program who want an external evaluation thaft is techni- 
cally sound as well as stJpropr iate within the particular program's 
constraints. This second concern involves asking the right, kinds 
of questions over the use of a certain kind of measure, design, ox 
means of analysis. - . 1 . 

The procedures^ td toe discussed belong tp a series of training 
workshops developed at the University of California, Los Angeles 
Center for the Study of Evaluation (CSE) over . the last few years. 
Each of the techniques, designs, and procedures is covered in de- 
tail in one or more of these workshops. The broader rati<?nale bp- 
hind these workshops .an4 the suggestions in this paper have been 
discussed i~n all earlier article (Burry, 1979). 

* • *» 
BACKGROUND CONCERNS 

General Problems in the Evaluation of .Bilingual Programs 

Evaluations rarely provide the most useful information about 
what bilingual programs^ are like or the range of outcqpies and lev- 
el of success they achieve. Ainong the reasons for this failure 
are the following: . ^ ♦ I 

1. Tltfose called upon to evaluate bilingual programs, though 
well versed in the general area of educational evaluation, may be 
unfamiliar with the particular* needs and characteristics of b^lin- 
gual education. ^ 

2. Evaluations of bilingual programs frequently provide only 
general .descriptions of the nature and "content of the programs; 
such information is of limited use to decision makers. 



3. Diversities among bilingual programs are great, and the 
extent to. which these ' differences are related to differential 
development across programs .is "still questionable r {Agreement is 
needed on such basic' assumptions, as what constitutes a minimum 
bilingual education and the kinds of v achie'vement gains and other 
outcomes to be expected of a p-articular prograj^ Attempting to 
measure the effects of a. bilingual program without detarmiiri^ig 
whether a bilingual ^pr^gram actually* exists or what ii needs to . 
accomplish makes it* difficult to establish valid criteria, for 

determining success or failure.) 

» ■» 

' 4. Even assuming the development of valid criteria, evalua- . 
tors must still explore the relationships among instructional 
strategies, implementation techniques ,^ and program outcomes. 

5. A shortage of adequate instruments for the assessment of 
students of non- and limited English proficiency (NE&/LEP) , ex- 
ists . }' - \ 

6. A problem exists over designs appropriate to the evalua- 
tion of bilingual programs ^ The fundamental questions regarding 



tfhese programs has been to determine whether the ^ scholastic 
achievement of students in bilingual programs^equals or excels 
what it would have' been had they remained in a regular,* monoliiT- 
gual course'. Two factors gteatly hinder the generation of r.felir- 
able findings: (a) bilingual programs are intended to provide ad-T 
equate education for a broad range of NEP/LE1P students^ and (b) 
methodological problems of* participant selection are likely to ob- 
scure program results. 



9 

ERIC 



\ The . above comments form the background for the three* most 

difficult problems in bilingual education ^evaluation. First, 
ther;e is much disagreement *ver what evaluation is and how it is 
done. Second., there is much debat.e^ over* wfoat ibllingual education 
is and how a bilingual program is planned and\ operated locally . 
Third, the nature of bilingual education* in terms of wh^ch stu- 
dents it should reach and help, creates^ problems^p methodology, 
especially in the areas of assessment and design. , 

An Approach to Evaluation \ \ ' 

Following current evaluation practice, many bilingual program 
evaluations appear guided by an approach that relates the methods 
of> evaluation with those' of research*. Th.is paper will distinguish 
between evaluation and researdh. This distinction is important, 
far beyond questions of topic of interest or techniques employed, 
involving fundamental differences in the kind of information 
generated*, how it is generated, and its intended uses,. Crorfbach 
and Suppes (1969), in their discussion of modcfs of inquiry, cite 
the need to: * . v 

...'distinguish decision-wriented from^ 

conclusion-oriented investigations. In a 
decision-oriented study the investigator is 
asked to provide, information wanted by a . 
decision-fnkker : ct^school administrator, a 
governmental policymaker, the manager of a 

\ s project... or the like* The decision-oriented 

study is a commissioned study. ^ The depision-r 
maker believes that he jieeds information to 
* guide his actions and^ he poses the question to 
the investigator. The coijclusion-o*iented M| 

^ study, on the other hand, takes its direction \| 

frdm ' the investigator's commitments and j| 
s hunches. The educational decision-mak0t can, \^ 

at most, arouse the investigator ' s interest in i'| • 
a problem. The latter formulates hi£ own \1 

>• • ■ ■ ■ 1 



1 



question,.* usually a general one rather than a 
question about a particular institution. The 1 
aim is to conceptualize and understand the 
4 chosen phenomenon; a particular finding' is 
only a means to that end. (pp. 20-21) 

In light of this distinction, evaluation can (fae categorize^ 
as decision-oriented and research as conclusion-oriented study. 

Evaluation differs from research intended to produce results of 

* • * • . . . * 

general significance; evaluation, with its decision orientation, 

is intended to provide results relevant for a ^particjilar pro-* 

gram\ ,at a specific point in time.: It differs from conclusion- 

oriented research in the reasons fot conducting the work, jthe con- 

straints imposed by the institutional setting, and the^in£ended 

use of the information generated.- Evaluation should generate in- 

formation that will lead to decisions «about educational questions 

and pr^obleffrs within a particular setting. - , 

The decision orientation toward evaluation considers both the 

manner and the times at which - evaluations might be con- 

, ' ' 

yductedJ It points out phases in the development of a program 

during which various audiences might effect ively , use credible in- 
formation. ^4-^ does ilot assume there is a single approach to 
evaluation, wtPWfch is 'universally appropriate. Rather, the ap-^ 
proach to evaluation is guided by the kind of bilingual program to 
be evaluated, the context or setting in which it operates, and the 
t real-world constraints with which, the program staff must work. 

Thf distinction betwefen researplP^nd evaluation should not be / 
interpreted to mean that research and evaluation are mutually ex- 
clusive categories. Evaluations of terr employ research methodology 
to enhance the generalizability of their findings. Further, many 



evaluations include some research questions. For Wxample, hypoth- 
eses regarding the relationship between certain aspects of the 
instructional program and student outcomes may be bested. Evalua- 
tion, *thus, may encompass research a"nd"is a somewhat broader 
/activity than classic research. Bath, however, are complementary 

activities? in the complex .world of Jthe evaluation of bilingual 

\ ■ • ^ 

programs , there, is room for both pursuits. ' 

F^om the above distinction between research and evaluation, 
the Center for the -Study of Evaluation has defined evaluation as 



the process of selecting, collecting, and interpreting information 

i 

for the purpose of keeping various audiences informed about a pro-""* 
gram. Usually these A audiences will, use the information to make 
decisions. The decisions needed to be ipade^a^ pro- 
gram depend on the program's* stage of development. Creating a bi- 
lingual program can be viewed 'as occurring in fdijir phases: 

Needs assessment. The first phase iri the cycle of program 
development is needs assessment, during" which the need for the 
program is established. " in this phase studerits are assessed to 
detWmine the range of English-language proficiencies that exist. 
This information will determine the most appropriate kind of bi- 
lingual program to meet students' needs. This phase will also in- 
clude evaluating the surrounding context in which the program -will 1 
havS io operate, e. g. , parental and community attitudes, support 
for the program, and *staff qualifications. 

Program planning . The second phase in program development is 
program planning. Here teachers, parents and other community mem- 
bers, curriculum experts, and others plan a program to. meet the 



high-priority goals determined, by the needs assessment. Plans 

should also be made for evaluation of the prograft at this time/ 

* *«■ ■" • 

including consideration "of the nreasiires to be used, the most 
.appropriate evaluation design , and nieansV' of analysis and 



-reporting.^ ■ 



The first: two phases are concerned with establishing the, p$o- 
gr'am's goals. A needs assessment sets goal priorities; program 
planning describes the program to reach these goals and how the 
program will be evaluated. Implicit : in *this model is one pre- 
scription about how evaluations should prbceed, that ia, they 
should address* the goals and 1 - processes of the particular program 
being evaluated. The evaluator should look first^ at the goals' 
and processes of the particular bilingual program and then design 
.an evaluation appropriate to that set of goals and programyproc- 
esses. TcNen^ure .the widest use of the information gathered/ both 
for people* in *tlife prqgram and for audiences external"* to it, the 
evaluator should alsj^-^^^n to describe and document the 
program's features wR^W, itineration. 

Formative evalu^tTf^.liy Formative evaluation requires col- 




lecting and sh)kcit^i4^^^ation for program improvement. While a 
program i§ be%^g installed, the format iv£ evaluator provider Pjo- 
gram "p^ann^rs '!ld staff wittf implementation information to help 
adjust and imprci^ it. Eventually , .of course, % peopl^|will Want to' 
know Whether or mi the program is effective; but tfiese questions 

1 • m < % ' ' v •■ - 

cannot be answeredv immediately. . They must await a summit lve eval- 
uation, that*- asks^|i?put the program's overall value and effect.. 



*While the pi/ogp^m develops , the evaluekfcr should provide,*" 
wide range of program implementation information. .This iiifor- 




mat ion can be used to decide if the prpgtam needs to be modified 
and/or to' check that students *re progressing. J « 
^ Summative evaluation . Summative evaluation examined the^pro- 
gram's total impact. After its developmental stage and when-func— 
tionlng ' as f intended, th6 program »is ready to. be summarily de- 
> scribed and perhaps fudged. .This sumhtary evaluation/ the kind of;* 
evaluation typically conducted, will include a program description 
and an estimation of its effect on student outcomes. , - 

Implicit in the four phases of program development and evalu- 
ation is the notion of program documentation. Documentation 




^b/lingual programs is a crucial activity* It adds to the eval- 
uation component an element that fully describes the bilingual 
practices followed in 'the program and tl^osf events and processes 
that interacted to achieve outcomes. ThisV documentation should 
describe the program's crucial features and be considered ah inte- 
gral part of program operation and evaluation. If worthwhile re- 
suits .are found in a program, it may be worth maintaining to test 
its future effects and to h$lp others assess its relevance in a 
new setting. Thus, simple statement of outcome an£ estimation of 
effect is not adequate to fully demonstrate program achievements. 
Needed 'is a thorough description that , in conjunction with state- 
ments of outcome', will document . the interact ions o^ process and 
S vents that constituted the program. 4 

The approach to evaluation described above permits considera- 
tion of a broad range of important concerns: content charac^ijg^ 
istics ^ich as class sizev^ scftiool/district^ features; and the 



prqgran^s time N frame; input characteristics such as student 
age, linguistic background, ability, and attitude; implement ac- 
tion characteristics such as program features and need fhr 
improvement; and outcomes, including both anticipated and 
unanticipated effects. It focuses data collection where program 
effects are most likely* to occur and gathers a wide variety of 
information, both process and outcome. Above all, it attempts to 
provide credible and useful information for a broad range of edu- 
cational decisions. 

The Nature of Bilingual Education 

What bilingual education is, or should be, depends on who you 

- ■ * 

ask. Some feel that bilingual education should be % transitional ; 
others feel it should be maintenance. Cultural, psychological, 
social, political, and educational issues must be considered: ba- 
sic issues of language dominance versus language proficiency and 
differing theories of learning, development, language acquisition, 
and linguistics. 

Four lhajor kinds of bilingual programs exist: 
Type I, Transitional Bilingual, allows pupils to adjust to 
school and/or subject matter until English skills are developed. 
In this kind of program, language development is the objective. 

•a 

Type II, Monpliteral Bilingual, works to develop both lan- 
guages for ^aural/oral skills but doesn't include literacy skills 

in the native language. This kind of program develops fluency in 

.... * 

the native language and is a compromise between language shi£,t and 
language maintenance. 



ERIC 



■ . ■ . 10 

t _ 

Type III, Partial Bilingualism, aims for fluency and literacy in 
both languages , but literacy ^in the native language- is restricted 
to certain subject matters. This kind of program emphasizes lan- 
guage and cultural maintenance* 

Type IV, Full Bilingualism, develops skills in both lan- 
guages. This kind of program is directed at development and main- 
* tenance of the native language. 

Given these differences in K program intention, the overall 
evaluation and procedures it follows should be appropriate to the 
nature and* purposes of the program being evaluated. 



A 



Methodology Problems in Design and Testing 



The minimum requirements for Title VII basic programs as 
amended by the regulations published in the Federal Register, 
March 29,^1979, include: * 

1. The .evaluation should assess project progress in achiev- 
ing its Title VII objectives in all components. 

2. Evaluation of pupil achievement should provide for some 
comparison of performance on reading skills in English 
and in the native language, with an estimate of what per- 
formance would have been in the absence of the program. 
Comparisons may be based on local, regional, or national 
norms on standardized tests; on historical data? or on 
achievement scores of a comp^r ison or control group. The 
limitations of such comparisons must be recognized in 
data analysis and interpretation. 



© 1 % 

1 l i 



K.' 



ERIC 



1 1 

3. Instruments to evaluate student performance ^should be 
described, as 'should .the rationale for their selection 
and procedures for use. 

4. Pre- and posttest results should be reported on all par- 
ticipating students (and comparison students, if used) 
using means , standard deviations, and appropriate tests 
of statistical significance, 

5. Proce^res should be established for determining when 
students no longer need assistance in developing English 
proficiency; this includes an individual evaluation of 
each student enrolled in the program for two years to 
determine if the student should remain in the program. 

A major consideration in planning a bilingual program evalua- 
tioh is to consider the establishment of an appropriate standard 
of comparison and the selection pr development of measures appro- 
priate to the desired means of j comparison. These decisions should 
be influenced by the particular program's features and relevant 
federal, state, and local requirements. While becoming familiar 
with the program* t\e evaluator must keep in mind the« bas3,c re- 
quirements of an acceptable evaluation and the design and testing 
ijssues associated with them. 

Federal regulations that provide for transitional bilingual 
programs call for comparisons based on local, regional, or nation- 
al norms; on historical data; or on achievement scores of a com- 
parison or control group. It^is generally impossible to establish 
a contrbl group to provide a close estimate of how well students^ 
would have done without the program. However, the evaluator 

*» 

\ ' ■ 

i 



10 



should attempt to make the best estimate available within the par- 
ticular program's constraints , noting any limitations stemming 
from the comparison used as the Estimate's basis. 

Once the standard o^comparison has been established , tests 
and other measured appropriate to that type of comparison can be 
selected or developed. Tests , whether , selected or developed, 
should possess acceptable technical properties and match the pro- 
gram's curriculum and the l/anguage of instruction. Given the cur- 

-A \ 

rent limitations of both norm-referenced and criterion-referenced^ 

» 

tests, and because all* aspects of a bilingual project cannot, be^ 
assessed \by achievement measures , a br,oad data base should be - 
developed. This base may rely oh standardized and criterion- 
refereK^ed tests as well as other measurement techniques such as 
questionnaires and observations. A broad data base, in addition 
to serving the need of. program documentation,* may help to allevi- 
ate some of the ^jnore troublesome problems discussed below. 

In general, program* evaluat ipns/are conducted to provide evi- 
dence of a program's success. They try to answer the basic ques- 

« 

tion, "Did the students in bilingual programs achieve as well of: 
better thanl if they had remained in a regular, unilingqal pro- 
gram?" 

As mentioned earlier , two factors pose design problems for 
evaluating bilingual programs: (1) the broad^range of NEP/LEP 
students, and (2) the methodological .problems of participant se- 
lection. 

Many designs have been proposed based on comparison 
groups. Yet bilingual programs can rarely be studied experimen- 



tal^Ly because legislation or program administrators may prohibit 
withholding bilingual services from students entitled to them. 
Because of legal and 'ethical considerations, students who might 
benefit from bilingual services cannot be randomly assigned to a 
non-treatment comparison group. Further, bilingual programs do 
not consist of a single isolated treatment that can be evaluated 
experimentally. Many employ multiple compensatory efforts that 
overlap, so student progress might be due to any combination of 
these trea|^ients. 

For ths&e reasons^ evaluations frequently use other than raflr 
domized controlled experimental designs. Since .these honexpe£i- 
men^sL^omparisons are confounded in various ways, reliable base- 
line data will provide information that helps the evaluator deter- 
.mine whether or not a program has significantly affected student 
performance. Gathering baseline data requires collecting informa- 
tion prior to the program's implementation and ^comparing that 
data to the program's results. For example, it may b£ necessary 
to consider sources of student ability differences. Issues to ex- 
plore include the student ' s initial linguistic status^ the period, 
in the student's development in which the second language is in- 
troduced, and the length of timejthe student has been in the pro- 
qram. Such factors could either, be controlled when evaluating 
program impact^ or examined* as . to the effect they have on the bi- 
^l^ngual student's education. 

Initial language dominance and/or proficiency must be consid- 
ered when evaluating a program's effects. Language dominance and 
profidiency are oft^n the primary criteria for selecting students 



to participate in bilingual programs . In addition to its applica- 

^ - . . ' 

tiorx in student placement , initial diagnosis of language skill is 

also important for understanding test , relevance and interpretation 
of subsequent test performance. Previous* bilingual evaluations 
have revealed that the continuum of relative fluency in both Eng- 
lish fcnd the native language can be extremely large among Trilin- 
gual program students. Thus, proper generalizability about the 
students served depends on an approximate account of the hetero- 
geneity of the dual language * skills present in programs. By con- 
trasting similar languages groups, it may be ^possible to provide 
much more detailed information about prog jrjpn effect. By estimat- 
ing students' educational gains in terms' of - teheix particular lin- 
guistic groups, the gain derived is alsro likely to be more . 
reliable. * ^> 

^ The appropriate time for introduction of a second language, 
of reading, and the length of time students -will be in a progran^ 
before final assessment are issues that may demand longitudinal 
(more than a one- or tw^year time span) evaluation. Longitudinal 
^tudifes are critical to determining how two languages should be 
used within an entire program curriculum and what age^are most 
conducive to language learning and achievement. Because of the 
number of variables involved, bilingual program evaluation should 
be basefl on: (1) multiple questions (not just whether students 

educationally benefited more by program participation as opposed 

* . * 

to remaining in regular 'classes), and (2) multiple analytic de- 
siqns ferather than the traditional cross-sectional design associ- 
ated'with end-of-year summative evaluations ) . 



Beyond the question of design , lack of adequate instruments 
for the assessment of NEP/LEP students presents another problem 
for evaluators. Much has been said about the inadequacy of stan- 
dardized tests for assessment ^bf culturally and Linguistically 
different students. These limitations are magnified when norms 
are used, since NEP/LEJ? students are generally underrepresented in 
the field testing and norming sample and separate norms for; these 
groups are not available. Moreover* even when adequate^ field 
testing^ha^ been conducted among NEP/LEP students, the possible 
linguistic and cultural biases of many standardized instruments 
undermine their validity and reliability, ihe problems are par- 
ticularly acute with respect to Englisrh language measures, but ^re 
often equally pervasive in instruments that are simply transla- 
tions from tfnglish-language versions. 

There is a serious shortage of technically sound instruments 
t^at^e culturally and pr ©grammatically appropriate for nonr ar^d 
limited English speakers. Test scores obtained .on instruments 
currently in use can, vary considerably based on- the congruency be- 
tween the curriculum, and test content. 

An 'alternative to standardized tests is the use of criterion- 
referenced tests. These , may potentially indicate the extent to 



which students have mastered their instructional program' s--ob]ec- 

■ ■- / ■ , . « Jr 

tives. They may serve particularly well for diagnostic an<Klpre*- 

\ ) - ; V 

script ive information at the local level. But criterion- 
referenced tests also have limitations. Sparse technical informa- 
tion on them is available, and^how appropriate it is to use them 
across projects is questionable. Further, because of ? the varying 



lack of difficulty of items and objectives, it % is difficult to 
compare an achievement score on one test ^o that of another. The 
preceding design and testing issues > should be kept in mind while 
the evaluator examines the program f s character istics . 



4 



IMPLICATIONS AND SUGGESTIONS FOR PRACTICE , 

A • 

Assessing Title VII Program Gains » 4 

* Title VII evaluation^ often intend to provide information for 
use at local, statef and federal, policy levels; and they are often 

_more concerned,* wi£h broad accountability than with programs- 
specific information. " In addition, Title VII audiences tend to 
ask for information that can be used to compare students . in the 
program with other" students. Because of accountability and com- 

<• par ison concerns, Tiiie VII evaluations oftejv rely on t;he collec- 
tion and analysis o£ pupils 1 scores on a norm-referenced test. 
This k^nd of test has a specif ic^purpose an^d , measures certain 
kinds of* goals, and its #£e as the sole measijre for Title VII 
program*evaluation$ is Subject to question. ^ ■ % -f * ; 

A norm-referenced test draws its content ^ from a general body 
of subject matter. It is designed to give information that can be 

used * to compare a pupil's performance with the performance of 

' '* 

other pupils assessed on the^same test. It demonstrates the 

... . 
differences among pupils on the basis of their test scores 

gather than showing exactly what they have learned. 
^ A norm-referenced test can compare pupils because it is used 

to place- the pupils on a scale of performance- from highest to low- 
est on the basis of their test scores. Because a norm-referenced 



. < • • ■ . ■ 

test ranks students, it might not accurately show what a Title VII 
program has actually accomplished for th'e participating pupils. 
To rank students on* the basis of their test scores, pupils' scores 
must be changed to another kind of score, such as a pe^pentile. 
This kind of score does npt show how much or how much of 
What a pupil has learned because the Jfest . it comes from mea-, 
sures general, goals instead* of specific ptogram objectives. 

Because a norm-referenced test covers broad goals, it may not 

•/ •+ * 

adequately measure a school's Title VII program's particular in- 
'? 4 
structional objectives. v If the validity of a test caii be chal- 
lenged, then statements about its reliability or consistency have 

little meaning. ' 

Due to these problems, criterion-referenced t^sts hav^ been 
proposed instead ^of, pr in ' association with, norilf^cef erenced 
tests. A. criterion-referenced, or ob jectives-b,ased, test is not 
meanfto rank or compare pupils- .on the basis of theis -test scores. 
Instead, it is intended to show how much pupils have., 'learned 
in terms of the specific objectives 6f .the ^ progr^m^they are in. 
This kind, of test often has a standard pt~158rtornfenca or cut-off " 
score used for -making decisions about the individual pupil '.s 

- ' y 

learning, independent of what the other pupils have . leaned . 

While criterion-referenced tests are intended to provide spe- 
cific performance information about individual pupils, they should « 
not be seen, . as a cure^for all tejStin^ problems. Some issues , must 
still be solved in the , development and use of' these tests. For 
example, a criterion-referenced test might not provide a base that 
catfi^e used to int$\r^ret what pupil aciuevemtfpt of specific object 



tives actually means. That is, on the basis of the test scores - 
.alone, it might be di^fcicult to judge the educational significance 
of attaining a set of instructional objectives, to decj.de if that 
attainment is important or trivial. Methods to overcome this 
problem will be covered later in this paper. 

Norm-referenced tests, beyond any general limitations, may be 
even more limited when used to evaluate educational programs like 
Title VII. A norm-referenced test is usually tried out with a 
national sample of students to see how well it functions. % These 
students are 'called the norming sample because their test 

scpres are tfsed to interpret tifk scores of other students. Since , 

' ,,;y - * ' ' ' * ^ 

pupils who are eligible^ for Title VII programs differ considerably 

^from the norming sampler comparing them with a national sample is 
not the best way to determine the quality of their performance. 

There are other problems in the use of norm-referenced tests 
for^Ttitle VII program evaluations. Title VII evaluations usually 
try to state what the* program accomplished for its pupils. One 
way to make this kind of statement is to compare' the performance 
of the pupils in the program with identical pupils who did not get 
the program. ■ But, as mentioned earlier, establishing this kind of* 
comparison group, consisting of students who are eligible tor the 
program but do not get it, is difficult. Thus, fori Title VII 
evaluations, where it ^may only be possible to get information on 
participating pupils, the need is for a method of obtaining infor- 
mation interpreted in terms of pupil progress comparted to national 
student . Sjjtf2ndards. 



' _ , r 

s \ " ' ( 1 19 • 

f The two most frequently used strategies for establishing com- 
parison on the basis of norm-referenced test scores have problems 
because of the assumptions they make about -.Title VII * pupil learn- 
ing. Proponents of one apprpach believe that pupil learning r 
rather than Jpeinq cumulative , will follow a straight line. If a 
pupil's score on the^pretest is known , that pupil '-s posttest score 
can be estimated t If a pupil scares higher than was estimated on 
the posttest,* then the program is said to be successful. With the/ 
"other approach K the assumption is. that* if a pupil scotres at a cer- 
tain level on-the pretext, at a certain percentile, for example, 
then that pupil will score at ; the same level unless educated in a 
Title VII program. If pupils score, higher orv the posttest, then ^ 
the program is presumed to have been beneficial. 

It is not clear whether the first procedure provides informa- 
tion that will make the program look* better or worse than it ^c|u- 
ally was.. But the seicqnd procedure may make the program look as 
if it accomplished less than it did. This is because pupils* 

in programs like Title VII will /not score the same on pretest 

* * ^) " ■ # 

and postttest if they do not get the program. Rather, they tend 

to tall further and further behind and score even 1 lower on th^ 
posttest unless they are provided with additional instruction.* si> 
if a Title VII program simply preserves a pupil'fe relative stand- 
ing from pretest to posttest, the program may have been 'more ef- 
fective than the scores Should suggest since the pupils did not 
fall further behind. * 

^ Both approaches also assume that norm-referenced . tests are 
accurate for measuring the effects of specific instructional pro- 



ERIC 



; " 20 

* .... 

* ■ 

grams. This assumption is questionable and suggests the possibil-y 
ity of using criterion-referenced test^, even though this kind of 
test could still be improved, ( v 

To summarize to this point , there are problems in the use of 
n<3rm-ref erenced tests for Title VII evaluations and there are dif- 
ficulties in setting reasonabJ^e performance standards for the 
pupils in these programs; so it is hard to judge the ^educational 
significance of the gains accomplished by a given program. There- 
fore, there is a need for further work on:- (1) the identification 
and use of tests that are accurate for measuring the effects^ of 
Title VII programs, and (2) setting standards of performance > 
appropriate for students eligible for^iese programs. 

These problems highlight the need for a broader approach to 
evaluation that % uses r not only achievement tests but also other 
techniques such as observations, interviews, questionnaires, and 
other information that can be used' to see what the program accpm- 
plished. These kinds of measures can be used to provide a back- 
ground of information for interpreting what the test, scores actu- 
ally mean, whether these test>s are norm- or criterion-referenced. 

This expanded approach tb evaluation is necessary not only for the 

v ' \ ... 

interpretation it offers but also because of the contribution it 

can make to h^lp determine exactly which components of the Title 
VII program are most effective. , ./•*■. 

i 

Title VII Program Components and Pupil Performance 

From the above, it should be clear that Title VII evaluations 
can be improved if information is provided that will help inter- 



pret the meaning of pupil s^res^ Such evaluations will be cqI- 
lecting information for decisions about the worth of the local 
program. . The local program w*ll essentially be one version of a, 
larger externally-funded program since each loca^ school v slte 

■^/<3^ten decides what its program - Will lo3k like and what it Wl*Jl do 
to" achieve its objectives. Because of Q variation frf>m school to 
schools it is difficult tf> find out which kiri^s of instruct ioQal 
materials and strategies lead to jpupil gains. 

Local school district Titled VII programs are not all the 
same, Since each program tries to ;toe most beneficial for its par- 
ticular *pupils. Evaluation should therefore help clarify local 
program intentions and operations , help operators understand the 

^program they have implemented, and whether it is proceeding on 
target or needs to be improved. Using a broad set of measures is 
appropriate to this kind of assistance, if these measures are 
selected -or developed • so that they 'match the educational objec- 1 * 
tives of the local s<5n5^1 program and fit the setting in which it 
operates. 

* For example, since schools develop /their own objectives and 
expectations, the accuracy of a nopm-referenced test for a given 
school 's program must be questioned. It is possible that the test 
used is more or less appropriate for one -progtra^n compared to an- 
other, and so there are questions about the accuracy of the test 

'* » ■ T .... * 

scores provided^&nd what thgy mear» fpv the individual school. If 
one schQOl's objectives match better with ■ wfrat the test measures 
than those of other schools, * this School will possibly Score 
higher on the basis of this match, noV matter what takes place in 



its classrooms. If -a (school's objectives do not matfch the 'test 

* >> <> . * \ - 

well/ this school's test scores would possibly be low even, thought 

* " , * * 

it was benefiting the pupils. — 1 ' ^ , * 

'■**»■ * * •'.*'«,. * \ m " * 

Each school planning ^nd developing its '6wn v iilfetruct ional 
program may rely on a wide variety of materials and strategies. 
Therefore/ it is important to know how well f certain set of in- . 

v - ■ ■ ■ ' - * 

structional practices fits the test used as the principal means, of 
evaluation. V Schools can dflso differ in how and how much they use 
resource teachers, aides , 1 teaching assistants/ and volunteers. 
The extent to which such re^our^es can affect program .impact needs * 
to be verif ied; . t^heir' effect cannot be determined on the basis of 
test scores alone. « . < & ■ % , ^ 

In short, current evaluation practice makes it difficult to , 
determine the degree <pf pupil gains and which particular school 
efforts led to them. Evaluation practice mu3t be broadened" if we 
are td jO^ge the educational significance of , the gains growing dutf^ 
of Title VII programs. $tew techniques for describing and docu- 
menting local school practices must be explored so that we can de- 
termine which specific Title- VII instructional features s^ceed 

r 

and why. . m ( * . £ . 

> No matter which test or kind of testi is used for assessing 
Title VII pupil outcomes , the first evsdua^on priority is a care- 
ful analysis of the school's Title VII curr/culum. fc This informa- 
tion can help determine the match between the school's curriculum 
and the objectives the test measured.^ From this, bver time, de- 
scriptions of Title VII instructional features and the kinds of 
educational gains they lead to can be developed. 



At the same timer accurate and cost-effective means of docu- 
menting or describing Title VII instructional practice at the 
school and classroom levels should be investigated. By thoroughly 
describing the total instructional setting , a picture will emerge 
of the specific school/classroom objectives; how these objectives 
were established; instructional materials and practices used;, 
kinds of students for whom the instruction is most effective; set- 
ting and manner in which instruction and assessment take place;, 
extent to which instructional practice was moHfififed on the basis 
of ongoing use of assessment information; kinds of modifications 
they led to; and manner and extent of use of human and material 
resources. In turn, identification Of successful and unsuccessful 
Title VII program operation will be enhanced. , 

Given the previously described problems With norm-referenced 
tests, use of such tests should be subject to further investiga- 
tion. Should a norm-referenced test be used, it should be accu- 
rately established how well it fits with the instructional objec- 
tives and practices of the program evaluated. Scores obtained 
would then be interpreted in terms of; how well the test measures 
the individual school's objectives • > f 

Because a norm-referenced test is intended to show the total 
picture, it might best be used on some sampling basis. That is, 
not every Title VII pupil needs to be tested on a norm-referenced 
test for the broader picture to emerge. Sampling is attractive in 
that it will reduce the burden of testing time. It will also pro- 
vide a background for • comparison that will enhance possible Title 
VII use of criterion-referenced tests, which will assess spe- 



^ 24r 

cific instructional objectives instead of the broad program fea- 
tures* 

A good criterion-referenced test should be tied in to a spe- 
cific set of educational objectives. As with norm-referenced 
tests, the fit between the criterion-referenced test and the pro- 
gram^ objectives and instructional practices must be determined. 
This fit, incidentally , may involve not only changing the test, 
but also restating school-level objectives where necessary. 

Over time, these criterion-referenced tests should provide a 
more precise picture of pupil achievement in terms of specific 
instructional objectives rather than broad educational goals. A 
norm-referenced test, however, might continue to provide part of 
the information, especially that which looks at the program as a 

V ■ 

whole, with a criterion-referenced test providing information 
about the specific components and objectives of the school or 
project. 

Since some audiences will continue to aslSfcfor ways to inter- 
pret criterion-referenced test scores so that they provide some 
normative or comparison basis, it ma^ be n^e^sary to devise meth- 
ods of equating criterion-referenced scores with norm-referenced 
scores. With this kind of strategy, norm-referenced and 
criterion-referenced - scores are part of the overall evaluation 
effort; and since two kinds of pupil scores will, be available, the 

•»>» 

fx 

possibility of more meaningful assessment is increased. This 
technique helps overcome the problem of interpreting the meaning 
of criterion-referenced test scores mentioned earlier. 



25 

The strategies suggested above can reduce the amount of 
testing in Title VII schools while increasing its 'decision-making 
Value! for different audience levels, and/ consequently, a more 
accurate demonstration of the impact of the program on the partic- 
ipating Ejupils. 

In Baker et al., 1$l80, procedures were described by which 
project staff can examine/select norm-referenced tests in terms of 
their technical properties and their match* with hie 'program's ob- 
jectives. The paper also discusses procedures for developing/ 
selecting criterion-referenced tests. 

Describing the Program that Led to the Gains 

In the procedures .described above, information on what stu- 
dents actually received in the program must be collected so that 
statements can be made about what parts of the^program led to 

which outcomes. A documentation system is needed that will pro- 

* 

vide valid and reliable information about program implementation. 
The three basic approaches to documentation consist of information 
gathered directly from program participants, examination of pro- 
gram records, and observation. Information gathered irom program 
participants can consist of staff reports, questionnaires,, and in- 
terviews. Examination of records can consist either of record- 

0 

keeping systems .designed specifically for the program's documenta- 
tion or they can consist of records that naturally evolve during 
the life of the program. Observations, which can be informal or 
systematic, usually take the form of checklists, coded reports, or 
delayed reports. % 



V 



26 

These kinds of documentation procedures, in conjunction with 
accurate assessment of pupil performance, should lead to Title VII 
program evaluations, which offer an accurate picture of the pro- 
gram's achievement, the educational significance of that achieve- 
ment, and the particular program components contributing to 
.achievement. This approach obviously relies upon the use of mul- 
tiple measures and data-fathering techniques that let us assemble 
converging data* To the extent that it is % appropriate and /" 
feasible, a combination of interviews, records, and observa- 
tions can be used to generate information that supports or quali- 
fies the picture of the program gained by each -single approach. 

Elsewhere (Burry, 1981), I have described the pros and cons 
of each of the above techniques, how they tie in with the larger 
evaluation effott, and the program situations in which one tech- 
nique is more appropriate than another. I also offer some tips 
for designing, constructing, and testing each of the documentation 
techniques. 

Selecting an Appropriate Evaluation Design 1 
and Analysis Technique 

*' . An evaluation design describes from whom evaluation in- 
formation will be collected, with what measuring device, and^ 

A 

at what times in the' life of the program. When an evaMator 

{ ' * 

plans a design, he or she has to decide on: (1) groups or units 
from whom information ' will be collected, ( 2 ) . measurement instru- 



9 

ERIC 



1 The suggestions for design are adapted from a workshop (Winters 
et al., 1960) dealing with the planning, design, and conduct 
of an active >ilingual program evaluation. 



30 



27 



ments to be used/ (3) times when instruments will be administered, 
and (4) appropriate procedures for analyzing data. 

Evaluation design is a management tool for organizing 
data collection activities. The design specifies the questions to 
be answered by the evaluation as well as the information that best 
addresses these questions and incorporates techniques to make this 
information as credible as possible , given constraints^ imposed by 
setting or time. "Design" does not only refer tq^ the statis- 
tics used for data analysis , although data analysis procedures do 
affect the credibility of evaluation results. Rather, design is a 
term for all the^org^n^zing activities related to information 
gathering: asking the right questions, identifying the informa- 
tion that will answer them, . selecting or developing appropriate, 
instruments, and applying appropriate data analysis procedures. 
Part of .the analysis consists of interpreting the information, 
drawing conclusions, ahd making recommendations. The evaluation 
design ^should pay attentidJ to the political context in which the 
program operates, the theories and assumptions of the program par- 
ticipants, and the program itself. It is essential to know what 
kinds of data gathering activities, from observation and inter- 
views through achievement tests, will yield information most rele- 
vant to the questions guiding an evaluation. Also necessary for 
the credibility of the evaluation is a familiarity with alterna- 
tive data analysis procedures and the inferences they support. 

The need for credible information arises in several kinds of 
-bilingual evaluations. As mentioned earlier, decision makers may 
be interested in prd^ram context, inputs, implementation, or out- 




T 



.' 28 

comes * The courts may be concerned with whether the program pro- 
motes effective participation in the educational process. Funding 
agencies need reliable data related to funds allocation (program 
inputs')', program installation (implementation), and student 
achievement (outcomes) . Schpol administrators are concerned with 
student achievement and effective allocation of educational 
resources. Project staff, in addition to having the above con- 
cerns, also need information for use in on-going .program improve- 
ment; Parents, whose participation is required- in the planning 
and implementation of bilingual programs, are concerned with stu- 
dent outcomes and the school environment. 

Each of th^ four evaluation concerns — context, input, imple- 
mentation, and outcomes— generates different sets of questions to 
guide the evaluation. If decision-makers need information about 
program context, they may have some of the following questions in 
'mind; 

1 • Why was the program installed? 

2 % #iio is interested in the program and why? 

p. What are staff and commmunity attitudes toward the 
progrant? 

4. What bilingual education . theories guide the program, and 
do staff members concur in their theories? 

These questions get at the "climate" in which the program is oper- 
ating, possible conflicting interests in program results, and rel- 
evant conditions existing prior to program implementation. 

Funding agencies , administrators , and project ~ staff often 
want to know about the quality and quantity of program inputs in 
order to assess the level of program implementation or the rela- 



tionship of these inputs to program outcomes. Some input ques- 
tions are: 

1. Which languages are spoken by students and how we JL I? 

2. What is the home language used by students? 

3. How many bilingual/ESL teachers are available, and what 
is their competence to teach in bilingual programs? . 

4. What kinds of materials are used? 

5. How ^much money is spent on salaries, materials, aides, 
etc.? ' 

^ese questions deal with the kinds and amounts of resources used 

in the program. ' , 

Program implementation is of special concern to the evalu- 
ator. One must know that the program really exists , ' what its 
goals are, and what it looks like before data collection can be 
planned. - Program implementation data are also very useful for 
substantiating theories of bilingual instruction, assessing, future 
program needs , and examining the, relationship between program par- 
ticipation and student achievement. Implementation evaluation fo- 
cuses on such questions as: 

1. Is there a bilingual program in operation? 

2. What are its major features? 

3. How much time is being spent in various activities? 

4. What are teachers 1 preferred teaching styles? 

5. How are materials being used? 

6. What are the patterns of teacher-student language use? 

7. Is th6 program complying with legal guidelines? 

8. Does the program need to be, improved and, if so, how? 



33 



30 



A final evaluation focps is on prografc outcomis^summative 
evaluation/ This perspective , as* I mentioned earlier , is, perhaps 
the most familiar and certainly tfie one that often comes to mind 
when you think of "evaluation." The focal pojLnt of outcomes eval- 
uation is on student achievement^ As seen earlier* federal regu- 
lations require student achievement data.^ Evaluators, then, may 
„be requested to provide information about: 

1 . How well the program promotes student achievement vis- 
a-vis a norm group. > 

2. How students in the program compare to those in a similar 
° bilingual program. 

3. How students compare to students receiving no special 
bilingual instruction, 

v ft 

4. The pattern of student achievement over, time. 

To summarize) . bilingual evaluators need designs that provide 
credible information for a^wide range of questions. A variety of 
information should be gathered i^'Suiingual program evaluations 
because there are many diverse audiences with interests in pro- 
gram processes and outcomes. There "^^fcfi^ little baseline information^ 
available for decision makers to know what the program should be 
when fully implemented , w^jlt desirable and normal patterns of lan- 
guage growth in the native language and English should be, and 
what instructional strategies and materials are most positively 
associated with^desired student outcomes. In addition, different 
audiences nave different information needs. Tire bilingual program 
evaluator, therefore, must provide a variety of background, de- 
scriptive, process, and outcome information to augment subsequent 
program planning and evaluation efforts. V 



31 

* ■ , '■ 

* Differences between bilingual and monolingual evaluations and 

the threats to^inf ormation validity posed by sampler instruments,, 
and extraneous factors in bilingual settings might suggest that it 
is impossible to conduct a credible bilingual evaluation. But 
there are three powerful concepts that help counter * many threats 
to validity. These concepts are: comparison, controlled assign- 
ment , and multiple designs. 1 . - ' 

Comparison . One way to avoid many of the threats to validity 
is to^gather data for making comparisons between the program group 
and some external standard. The notion df comparison is powerful 
and (jteals with the threats to validity contained in sample, 
instrumentation, and extraneous factors. v ; 

Under ideal circumstances, the evaluator fcould find a group 
of students who were alike in all relevant educational character- 
istics, such^as abilities and^p^rental and home characteristics, 
assign a sample to a special program, or "treatment ," while other 
samples would continue in the regular educational program. For 
example, the group under study may participate in a new program 
designed to improve reading comprehension. Qther thar^participa- 
tion in the new reading program, everything about the two groups 

would be the same. 

The* evaluator might ask if students in the reading prograflf* 
show greater growth in reading comprehension than students in the 
regular program. To answer this question, the evaluator might 
give both groups a reading comprehension test at the beginning of 
the year and then administer the test again at the year/* end. If 
the control students 1 scores average 23 on the pretest and 29 on 



r - 32 

1 " 

the jposttest, and the reading program students' scores went from 
to 40 on the posttefct, the differences in scores between the 
two groups would be evidence for the reading program's value. 

It is, however, never possible to find identical groups or 

students who 5, differ only with respect to' a single educational 

'l » * 

treatment. Yet if judgments about the relative performances of 
two groups are to be fair, the groups nu*st be as similar as possi- 
ble. One way to enhance- the comparability of the groups i^s via 
controlled assignment. 

Controlled assignment . Since evaluators cannot really find 
groups with identi<g§l/ characteristics ; they may use random assign- 
ment to ensure that both . groups have similar characteristics and 
resemble, with only chance variations/ the ^enexral diversity com- 
monljT occurring in the population they represent. The notion be- 
hind random assignment is that any member of the population (such 
as second graders in a particular school) hefs an equal chance to 
be selected for one of tjhe groups * (either the control ojc the 
'treatment group). 

Randomization in a school setting is not easily achieved. 
Students are'not d^dinarily assigned to classrooms randomly. Even 
within classrooms, they may be assign^ to special^ treatments on 
the basis of need rather than randomization. The situation is 
further complicated in most bilingual prograiirs if all students 
eligible for the program must be served, effectively precluding 

use of a control or comparison group. 

s 

But the notion of controlled assignment does address the 
probl#i of finding an appropriate comparison group for the evalua- 



33 



tion. If the process by which groups are formed is known, for 
example, then aihlyial^ differences between the groups can be con- 
trolled or explained; here the evaluator can* use naturally oaciir- 
ring equivalent groups for comparison. An example of naturally 
occurring equivalent control groups is a comparison between stu- 
dents enrolled in a bilingual program in one school and students 
enrolled in a similar bil<tngual program in another school in the 
same district (note that the similarity addresses both students 
and^their program). Or,' you might find an "equivalent" com- 
parison group by locating another classroom in another schbol or 1 
district whose students have the same language proficiency levels, 
native language background, and}* socio-economic status as. those 
enrolled ir^the bilingual program. Randomization/ matching, and 
selection of comparison groups on the basis of a cutting score are 
all instlhces of controlled assignment. Any controlled assignment 
procedure can be applied to different kinds of groups, to. stu- 
dents, to classrooms (without regard^to^how the students in them 
got th^te), to schools, or to districts. If the process by which 
the comparison groups were formed is known, and the potent ialV^Ys- 
tematic differences between them are documented (using the 
kinds of procedures mentioned earlier), you will be able to reach 
fairly -good conclusions about student achievement based on these, 
comparisons. y ^ 

Multiple designs . The third concept for ^improvement of de- 
sign is the u^e of multiple designs. 4 The information needs of bi- 
lingual prlgrams are extensive, and they require several kinds of 
information that probably ^anno£fefcr addressed in any single de- 



34 

s 

sign/ Since design refers to, preselected units of observation 
measures, the timeline for data collection and data anal- 
ysis plans, it is unldkely that any one particular combina- 
tion of timeline, sample, instruments, and analysis would be ap- 
propriate for answering several different evaluation questions. 
For example, the evaluation design for gathering information on 
the question, "Is the program being implemented as planned?" will 
differ from one tljat asks ?How well are students who learn to read 
in their native language doing when compared to" those instructed 
only in English?" In the first instance, it Jjfill be necessary to 
get observation and questidnnaire data about teachers collected at 
frequent intervals I The "data analysis will co&sist, in part, in 
matching what was discovered "happening in the program to the offi- 
cial course description as x stated in the funding proposal and/or 
q^ourse of study. Needed for the second question, which may be 
asked during the same evaluation, will be achievement test data on 
English reading and reading in the native language as well as some 
of the descriptive information described above. These data will 
probably be collected at the beginning and end of the ^ar from 
students enrolled in the bilingual program as well as NEP/LEP stu- 
dents not receiving reading instruction in their native languages, 
< *** 

Data will be analyzed by testing for significant differences be- 
tween group scores on the posttest, 4 . m 

■Several data collection plans can be incorporated into the 
evaluation, each chosen to collect information f rom» a particular 
group by a particular, method at a specified time.. Some of 'these 
designs may include random assignment within'the program in order 



to answer research' qftest ions. Other designs will involve neither 

' J . f v . •■ .. ■ 

comparison nor controlled assignment because they are intended to 

gather information to be used only within the confinps of the par- 
ticular program. 

'In planning the evaluation, one must' consider what wijJ^be 
^aked about the program. Will comparisons need to be made between 
program students* and students not in a; bilingual classroom? 
Between "program students and a norjn group? Must the trend of lan- 
guage acquisition be shown or the parts of the program contribut- 
ing to «student achievement be identified? Is it necessary to know 
how people feel about the program and why? Each of these informk- 
tion requirements leads to different statistical analysis proce- 
dures. 1 

Generally, both descriptive and inferential statistics should 
be reported. Both instruments *and the kinds of data they provided 
should be described before making judgments about program effects. 
This preliminary data ^analysis will include instrument statistics 
such as inter-rater reliability for • all constructed response mea- 
sures, internal consistency, and" perhaps test-ret est (pre- and 
post) reliability for achievement tests, decision or classifica- 
tion consistency for criterion-referenced measures, item difficul- 
ties, as well as group means and standard deviations. 

In addition to descriptive statistics about your instruments, 
you will want fed provide intercorrelations among dependent varia- 
bles as an indication of their similarities and differences. *~You 
will also \examine all the data collected in order to answer the 
following: 



^ • . 36 

1. What general patterns emerge that can be examined , with 
inferential statistics? 

2. Are there missing data/ unusual frequency histr ibutions , 
small sample 'sized, or restricted variant that will 

• affeqf'the interpretation of evaluation results? 

3. Do the data to be used for making inferences meet the 
requirements for inferential statistics? 

Deciding on which kind(s) of statistical analysis technique 
to use is often seen as an overwhelming problem— beyond the scope 
of the project staff. Perhaps this is b^use of the esoteric 
language often used in discussion of statistics. * I'm not saying 
that statistical analysis is easy, but it need not be bothersome. 
If project staff do not have expertise in statistical analysis ,' 
then they should look for expert help. If they do look for expert 
help, then it is critical,, that the staff have some, background in 
the use of statistics so that they will know the right ques- 
tions to ask of the-erxpert, to assure the most appropriate statis- 

tical analysis, given the individual project, its constraints, and 

* * 
its information roeeds. 

The selection of an appropriate statistical technique is gov- 
erned by:- (1) the data the project is able to collect, and (2) 
the variables they are examining. 

^ A project can normally collect three kinds of data: nominal, 

ordinal, or interval. Nominal data means the information names 

> * • 

something-- it doesn't measure, it Qnly gives names. "The informa- 

tion is cl&spif ie*3 inta categories with no necessary relationship 
between those categories. For example, we might s^y that a class- 
room appears to have Some happy versus' unhappy childre^. 



37 

Ordinal data means the information has been ordered according 
to rank, witl) a categorization of these things, in terms of more 
than or leaa than. For example, we might have a measure 
that rank-orders degrees of student self-esteem. 

Interval data not only tell the order of things, but also 
tell the interval or difference between the judgments. For exam- 
ple, if one pupil scores 87 and another scores 77, the first stu- 
dent has not only performed better, but has also scored better by 
10 points. Rating scales and achievement tests are examples of 
devices' that give interval data. 

Data from one scale can, if necessary, be converted down- 
wards.. That is, interval da/ta can be legitimately converted to 
ordinal or nominal.; Data, however, should not be converted up- 
wards, for example, from ordinal to interval. ■ * 

An evaluation - might cover any of the following variables: 
independent, dependent, or moderator. An independent variable is 
the stimulus variable or input. It is the thing that i? examined 
to see its relation to something (e. g. , a test score) that we 
observe. Program and educational interventions are examples of 
independent variables. 

A dependent variable is the response variable or output. It 
must be observed and measured to find the effect, or the accom- 
plishments, of the independent variable such as a program compo- 
nent. Student performance on'an achievement test is an example of 
a dependent variable. 

A moderator variable is a special kind of independent varia- 
ble that may modify the relationship between the independent and 



ERJC 



39 

dependent variables. Student socio-economic status is an example 
of a possible moderator variable/ (For statistical purposes, mod- 
erator variables should be considered to be independent vari- 
ables . T 

Statistical analyses can be paramettic or nonparametric. 
Parametric analyses , in short , make certain assumptions about the 
data and make strict demands . For/ example, use of a parametric 
technique might be based on the assumption that student score% 
were drawn from a normally distributed population, that is, the 
normal, bell-shaped curve, or that both sets of scores were drawn, 
from a population having the same variance, that is, the same 

V 

spread of scores. Uging a parametric technique assumes the exis- 
tence of interval u data. Nonpatag^fe^ic techniques, on the oth£r 
" hand, do not require that data be normally distributed or that the 
sample varianqes be equal. 

Frequently used parametric techniques include the t-Test, 
parametric correlation (Pearson product-moment) , and analysis of 
variance. Frequently used nonparametric techniques, corresponding 
in use to the three parametric techniques mentioned above, are the 
Mann-Whitney U-Test (a kind of nonparametric t-Test) ; Spearlnan 
rank-order correlation; and Chi square. 

These kinds of tests are intended to show the statistical 
significance of somethihg; for example, the difference between a 
pre- and a post-measure of a program component. They show whether 
the differences seem to be the result o>f chance or whether they 
are more likely to be the result of the program. The greater the 

* AO 



level of statistical significance^ the more likely it is that the 
differences are the result of the program,. 

Selection of the appropriate statistical technique will de- 
pend on the kind (nominal, ordinal, interval) and number of depen- 
dent variables we have and the kind and number of independent var- 
iables. Since the mix of variables and kind of data sometimes in- 
dicates only one appropriate technique, and sometimes offers 
alternatives to select from, decisions about the whys and where- 
fores of analyses should involve discussion between staff and 
evaluator. 

The Range of Designs ? ft 

Figure 1 lists a series /of possible evaltfatipn designs, the 
kincls^of comparative data they generate, and the requirements that 
cut .across designs. Figure 2 shows the kinds of threats to valid- 
ity for which each design is intended to compensate. 

Designs Recommended for Bilingual Programs 

The following are some suggested designs suitable for gather- 
ing information related to the major questions most often ad- 
dressed in bilingual evaluation. By examining the strengths and 
weaknesses of these designs as they^f^ge to the particular eval- 
uation concerns at hand, they can be adapted and/or combined as 
needed. 



2 Use • of these designs is amplified in Winters ^4t al., 
1980. S i 



40 



Figure 1 

REQUIREMENTS OF TYPICAL DESIGNS 



. Design 

Pretest/Posttest Control Group 

Posttest-Only Control Group 

Single Cohort Comparison 

Multiple Cohort Comparison 

Nonequivalent Control Group 

Regression discontinuity 

Regression Projectibn 

Normed Pretest/Posttest 

Normed Pretest , Loc^l 
/pretest/Posttest 

Minimum Competency 

Time Series/Longitudinal 
Multiple Time Series 
Exposure to Treatment 

Requirements for all Models 

1v Theory/knowledge of developmental rate of skill being 
examined. 

2. Evidence that the program existed. 

3. - Description of program and sample "mortality" with reasons , if 

possible. 

4. Documentation of activities that may produce competing 
^explanations of results. 

5. Valid and reliable instrunffents. 

6. Explicit criteria .by which program "success" will be judged. 



Comparative Data Obtained 
Equivalent Group 
Equivalent Group 
Equivalent Group 
Equivalent Group 
Nonequivalent Group 
Nonequivalent Group 
Nonequivalent Group 
National Norm GtOttfTN 
National Norm Group 

Prespecif ied Achievement 
Standard 

Withen Group 

Withen' Group 

Program to Itself 



Figure 2 

Relative advantages of evaujaticsn designs 
(Threats to Validity Countered by Design) 



l 

1 


Sample 


Instrumentation 


Extra 


medup v 




Selection 


Maturity 


Mortality 


Validity 


Reliability 


Admin. 

Procedures • 


Concurrent 
k Programs 


Hawthorne 
Effects 


Pretest/Fosttest 
Control Group 






> 






-\ 




- 


Poettest Oily 
Control Group 


















Single Cohort 
Comparison 


V 




7 

/ 












Multiple Cohdrt 
Conparison 








* 


i • 








Nonequivalent 
Control Group 








" 7 " — 










Regression 
Discontinuity . 










• 


ft 




✓ 


Regression 
Projection 












JL. > 






Normed Pretest, 

Fosttest 














- 




Normed Pretest, 

Local Pretest/Fosttest 


















Minimum Competency , 










« . 








Time Series/ 

Longitudinal 














4 




Multiple Time 
Series ' 


















Exposure to 
^eatment 












j 

4 







J 

\ 



ERIC 



<!5 



A 



It) 



( 



42 



The four design types suggested for bilingual programs are 
summarized in Figure 3. Each type was selected to address a dif- 
ferent , but frequently encountered , bilingual evaluation purpose. 
The designs are presented in order of utility , from general to 
specific. The most general-purpose designs are the time-series/ 
longitudinal for investigating program outcomes. The exposure- 
to-treatment design is widely applicable for formative evalua- 
tions. Accountability designs/ for reporting information about 
student achievement, either in terms of local program objectives . 
or national norms (when Required), are also presented. 

For each design, ttye figure also specifies the comparison im- 
plied by the design as well as the requirements that need to be 
if the design is to be used. * 

s CONCLUSION ■ . ;♦. ' ^ 

As evaluators of bilingual programs begin to plan, they must 
consider the various agencies— federal, state, aild local — who di^ 
rect the program. The legitimate differences that exist ampng 
programs make it inappropriate to apply one set of design or test 
criteria^to all bilingual programs. The evaluator must consider 
the particular needs of # the program evaluated, recognize the in- 
herent difficulties in design and testing, and plan an evaluation 
appropriate to the program and feasible yithin its constraints. 

It is critical that the goals and objectives of a bilingual 
education program be made explicit so that valid criteria for de- 
termining program success or failure can be established. The goal 



■Y 



9 

ERLC 



A 7 



' •<? Figure 3 

RECOWENDED DESIGNS FOR BILINGUAL PROGRAMS 



Evaluation Purpose 

All-purpose design to prdvide 
documentation of program de- 
velopment, incrementation, and 
student achievement over time. 



Design 



Time Series/ 
Longitudinal 



Implied Comparison 

Within group or age/ 
grade cohorts 



Requirements of the Design 

1. Way to trace/document departure of 
students from bilingual or comparison 
groups to explain "mortality. 11 



2. 



Tests/instruments that provide oonpar- 
able data over time, e.g., same test 
with many levels and alternate forms, T 
observation scales that remain the same 
across the period of the program evalu- 
ation, or highly correlated 



Identifying program strengths 
and Weaknesses in relation 
to student achievement. 



Exposure to 
Treatment 



Parts of the program 
with each other 



1. Identified theory of what variables gqn 
tribute to bilin^^^student achievemen 

2. Achievement- instruments that measure 
program goals. 

3. Data gathering techniques that identify 
qualities of program implementation 
such as time on task, amount of curric- 
ulum covered, etc. 



Accountability to school 
or district administration. 



Minimum competency 



To prespecif ied min- 
imal performance stan- 
dard and/or to students 
in district outside the 
bilingual program. 



1. Basic sk£ll/minimal competencies that 
bilingual students .are responsible for 
completing. ■« 

. . . 5 

•• • -f 

2. Reliable and valid minimal oonpetency 
test. " " Si v 'i 

3. Defensible minimal competency standard 
of excellence. \ 



Reporting required norm- 
refetenced test data for 
Title VII programs. 



■\ 8 



ERIC- 



Hormed-^fetest , 
Local pretest/ 
posttest 



National norm group. 
Other students Jraking 
local pretest and 
posttest. 



1. Norm-referenced instrument apprppriate't 
for measuring bilingual program goals, j 

2. Iteliable and valid local criterion- 
fceferenced test that matches program 
goals. 

3. (Optional) local* norms for criterion- 
referenced test. 



4fr 



44 



stateme nts s hould include outcomes beyond those specifically con- 
cerned with student achievement. 

Evaluators working in a bilingual, setting nyjst also become 
familiar with a variety of program features frequently absent in a 
unilingual setting. They must also build into the evaluation plan 
a variety of formative tasks. These tasks : should include examina- 
tion of program implementation to suggest areas of program im- 
provement and careful documentation* -or the implemented features 
and their relationships. Evaluators must also, attempt to improve 
their summative evaluation activities, 1 for example, in their esti- 
mations of which program features contribute to program outcomes. 

Evaluators Of bilingual programs must ensure that the appro- 
priate program features or variables are selected for evaluation.' 
-The evaluation itself — design, measures, analyses, and reporting— 
should be technically sound, acceptable under existing regula-' 
tions, and sensitive to the needs and constraints of the program 
under examination. * ^ 

These aims can be facilitated by: reviewing the operant reg- 
ulations; examining the program plan or description of the program 
(if such a plan or description exists h determining (in consulta- 
tions with program participants)" which features of the program 
need to be evaluated; deciding upon the means to evaluate these 
features; selecting methods to ensure ongoing " documentation of 
their implementation, relationships , and cumulative, effects; and 
considering appropriate means of reporting evaluation information 
to various audiences. 



Evaluations planning will be more feasible and effective the 
earlier the evaluator is involved in the program. If' possible, 
evaluator involvement should v occur before program implementa- 
tion so that the evaluation plan is an integral part of the pro- 
gram's operation. The kind of evaluation planning described here 
also relies upon*a program plan or description. » Such program de- 
scriptions may not provide sufficient information to permit ade- 
quate evaluation planning without discussions with program staff 
to elaborate intentions, achievement strategies, means of imple- 
* mentation, and expected outcomes. In short, the kind of evalua- 
tion planned will be influenced by the time at which the evaluator 
enters th^ program and the existence of observable and measurable 
program processes and outcomes. 

Hotf smoothly a bilingual program 1 s evaluation is managed de- 
pends on the care devoted to its planning. Information use will 
be enhanced to the extent that it is reported quitskly to t?he ap- 
propriate decision makers. 

Bilingual education programs are extremely gpmplex and ad- 
dress a great variety of needs and methods. Evaluation should 
provide greater documentation of bilingual programs in terms of 
their intentions and their implementation. It must clarify ex-* 
act ly what a particular program is to accomplish and how the pro- 
gram intends to meet its goals. Tests and other measures must be 
selected according to technical properties and relevance to 
the, individual bilingual program. Test results should be inter- 

u mlf 

preted in light of the programls objectives* In evaluation de- 
sign^ variables must be specified and controlled^. Design features 



46 



c 



St ■ 

should ajlow for information to improve the program (formative 
evaluation) and to show program effect (summative evaluation) . 

These procedures, if implemented at the level of the local 
project, will lead to more useful evaluations. Over time,, they 
should lead to an evaluation strategy consisting of: (1) informa- 
tion on pupil performance before entering the program; (2) infor- 
mation on how much instruction they received in the program, the 
manner in which they received it, and the context in which^ itwas 
provided; and (3) information on pupil performance after the pro- 
gram. ''Gains at \he end of- the^rogram could then be attributed to 
instruction of a certain kindk 



r 



9 

ERLC 



5 



47 



REFERENCES 



Baker, E. L. , L. P. Polin, James Burry f and C. Walker. * Making, 
Choosing, and Using Tests: A Practiqum on Domain-Referenced 
Testing. (Los Angeles, California: Center 'for the •Study of 
Evaluation, University of California/ Los Angeles ; 1980. 



Burry, James. Evaluation and ^DocTOEfe 
Together. Los Angeles, Calif orn 
of Evaluation, University of Cal 



Situation 

:nia^ ( 
ifornla , 



tion: Making Them Work 

Center for the Study 
Los Angeles, 1981. * 



'Evaluation }n Bilingual .Education," Evaluation Com- 



ment, VI, No, 1 (1979), 1-14. 



0 



Cronbach, L. J. , and P. Suppes, eds. Research for Tomorrow's 
Schools: Disciplined Inquiry for Education. New York: 
Macmillan, 1969. 



Federal Register. Washington, D. C: 
Office, March 29, ,1979. 



Government Printing 



Winters, L. , L. SRQoner-Smith , and D. Elvenstar* Planning for 

Evaluation Design. Los Angeles, Cali/fornia: Center for 

the Study of Evaluation, University/ of .California, Los 
Angeles, 1980. 



4 




ERLC 



